Overview
We often collect data from two or more groups.
Group allocations are categorical variables, and are stored in a special way by R which makes displaying them on graphs easier. In this workshop we learn to how to use categorical variables in boxplots, to colour scatterplots, and to make tables of descriptive statistics.
Techniques covered
- Making boxplots
- Adding color to a plot
- Using different types of data and visual scales
- Using
group_by()to make a summary table and compare groups
Boxplots
- Boxplots are useful for comparing between categories
- Use
ggplot; choose a categorical column as the x axis geom_boxplot()draws boxes and whiskers for each category- The box describes the interquartile range
- The midpoint is the median
- Individual points show outliers in the data
iris %>%
ggplot(aes(x=Species, y=Petal.Length)) + # specify dataset columns
geom_boxplot() # add the boxesBoxplots are useful for visualising differences between categories or groups.
Using ggplot, making a boxplot is similar to a scatterplot.
If we haven’t already done it, we’d need to load the tidyverse:
library(tidyverse)To make the plot, we first choose the data we want to plot and use add a pipe to the end of the line to send this data to the next command:
iris %>%Next we use ggplot and aes(x = ..., y = ...) to select the columns we want to use in the data. aes is short for aesthetics which means ‘something you can see’. So the aes function defines what will be able to see on the plot — the axes, and other features like color.
This time we have chosen Species for the x axis, because it is a categorical variable
iris %>%
ggplot(aes(x = Species, y = Petal.Length))[demonstrate using auto complete in RStudio when writing the code above]
You’ll notice that I used the RStudio autocomplete feature to write function and column names from our dataset.
This makes things much easier — especially for column names. If you type column names by hand it’s easy to make spelling errors or typos, and this leads to errors in R.
If we run the code so far, ggplot draws the plot axes, but doesn’t add the data yet:
[run the code so far]
So far we have said what we want to show on our plot, but not how it should be shown.
To finish, we add geom_boxplot() which actually draws the boxes:
iris %>%
ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot()Note that I used a + symbol to add the boxplot layer to our graph.
So, now we have a boxplot, with one box drawn for each value of Species.
Interpreting boxplots
In a boxplot:
the thick line is the median or midpoint of the data
the height of the box indicates the interquartile range (IQR), so this contains 50% of the datapoints. A wider IQR indicates greater variation (spread) in a dataset.
the whiskers vary a little bit, depending on the software you use — but normally show the range/spread of the data (The default is in
ggplotto show the points that are no more/less than 1.5 times the IQR above/below the top/bottom of the box.)
In this boxplot, any data point outside the range of the whiskers is described as an ‘outlying point’ or ‘outlier’ and is shown individually as a dot.
Exercise 9
- Create a new chunk at the bottom of your worksheet
- Create a boxplot with
Specieson the x-axis andSepal.Widthon the y-axis (sepals are the leaves that encase an iris flower) - Run the chunk
Your boxplot should look like this:
Using colour
- The points in a scatterplot can be coloured based on an extra variable.
- We can use a categorical or continuous variable for this (they will look different)
- This is done using by adding the
colouroption toaes().
# colour each point; different colour for each Species
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point()As we saw before, scatterplots show the relationship between two variables.
Using the mpg data, we could plot the size of car engines (displ, short for displacement) against their fuel efficiency on the highway in miles per gallon (hwy):
mpg %>%
ggplot(aes(displ, hwy)) +
geom_point()[this first plot is not shown here to save space, but is shown in the video]
Colour can be added to distinguish the points — for example, what kind of car each point represents.
So far, you’ve used the aes() function to define which variable is plotted on the x and y axes. The aes() function is short for ‘aesthetics’ (what we can see) and connects columns in our data to visual aspects of a plot, like the x and y axes.
The colour=... option adds to this, and says which column is used to colour the points. In the mpg data, the drv column records if a car is front (f), rear (r) or four (4) wheel drive.
We can write color=drv to tell R to colour each point, depending on which wheels are driven:
# colour each point; different colour for each type of transmission
# front, rear and four wheel drive
mpg %>%
ggplot(aes(displ, hwy, color=drv)) +
geom_point()This is an example of using a categorical variable to colour our points.
Exercise 10
- Create a new chunk at the bottom of your worksheet.
- Create a scatterplot using the
gapminderdataset withgdpPercapon the x-axis,lifeExpon the y-axis, andcontinentin colour. - Run the chunk.
Your plot should look like this:
Types of variables and visual scales
- Continuous, categorical and text (string) data are all common in psychology
- Internally, R stores data in a number of different data types.
- Sometimes we need to convert between data types
- For example, sometimes categorical data can get stored as numeric
- R uses data-types as a clue to set defaults (e.g. for graphs)
# see a list of columns in the dataset, and their types
iris %>% glimpse
# make a scatter plot with two continous axes
# both wt and mpg are numeric variables
# so this works well
mtcars %>%
ggplot(aes(wt, mpg)) +
geom_point()
# try and make a boxplot with `am` as the x axis
# but because `am` is stored as a numeric variable and
# not categorical the scale of the x axis is wrong
mtcars %>%
ggplot(aes(am, mpg)) +
geom_boxplot()
# re-draw the plot, but converting `am` to a factor/categorical
# variable first. Now the x axis looks correct
mtcars %>%
ggplot(aes(factor(am), mpg)) +
geom_boxplot()The video introduces the link different types of data (e.g. continuous, categorical, text), the way R stores them, and the way that ggplot presents them on the scales of a plot.
These might seem like small details — but having some understanding of these details will help later on.
Columns vs variables
The word variable gets used in at least 4 different ways in quantitative research, and this confuses a lot of people. We can’t always avoid this ambiguity because the usages are so common in the field, but it helps to know about them in advance:
variables in a theoretical model. That is, things we think exist, and which cause other things to happen. An example would be “empathy”, or “working memory”.
variables in a study design. For example an experimental group allocation, or an attribute of participants like age or gender.
variables in a dataset. Here we mean a column of numbers in a spreadsheet, where the column has a name. This usage overlaps with the previous two, but doesn’t have to. For example, we might have a column of numbers recording participants’ scores on an empathy questionnaire. But we might also have a column containing reaction times for multiple trials; analysed a certain way these reaction times might tell us something about working memory, but the column of reaction times isn’t the same thing as the variable in our theoretical model.
variables in R: these are a general purpose container which can store anything. Variables can contain columns of numbers, but they can also contain whole datasets, or graphs, or the results of statistical tests. A variable in R often isn’t the same thing as an experimental variable.
In this guide use variable to mean either an experimental/theoretical variable, or an R container. When we mean a column of data in a dataset we will use the word column instead.
Types of variable
If we are thinking about variables in our study designs, the main types are:
- categorical or nominal variables: these are sometimes also called factors in experimental design)
- binary variables: these are either true or false, and are a special kind of categorical data
- interval or continuous variables: e.g. heights, weights, or reaction times
- ordinal variables: e.g. a response to a Likert-style question, on a 1-7 scale
- count variables: how often something has occurred, measured in whole numbers greater than or equal to zero
Types of column
Datasets have multiple columns, each with a unique name.
There are different types of column in R. These data-types are mostly determined by the sort of variable. But there is some overlap, and the same variable could be stored in different types of column. We sometimes also need to convert between data types.
There are three data-types you need to know about in R:
- Factors, which are used for categorical data.
- Text, also called character or string data.
- Numeric columns can called be either integer or double. Double is computer speak for ‘double-precision number’, which just means decimals and really big numbers are allowed
- Logical, which can be used to store values that can only be either True or False.
You can see which data-type is used to store a variable when using the glimpse command you saw in session 1 (e.g. here).
iris %>% glimpse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …In the glimpse output you can see the variable names listed on the left, followed by grey text surrounded by angle brackes, e.g.: <dbl> which is the abbreviated data-type.
In this built-in dataset, most of the data is numeric (dbl), but the Species variable is categorical, and stored as a factor (fct).
It’s possible to convert columns from one type to another, to suit our needs, and we’ll see more of this later.
| Data/column type | Useful for | Abbreviations used/subtypes | Often need to convert from |
|---|---|---|---|
| Numeric | Continuous, ordinal | int, dbl |
Categorical; Text |
| Factor | Categorical, ordinal | fct, ord |
Text |
| Text | Free text, categorical | chr |
Categorical |
| Logical | Binary/boolean | lgl |
Numeric |
Column types and scales on graphs
If we look at the mtcars data we can see that all the columns are stored as numeric data (dbl):
mtcars %>% glimpse()
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…This is fine if we want to make a scatter plot (here of ‘miles per gallon’ vs the weight of the car):
mtcars %>%
ggplot(aes(wt, mpg)) +
geom_point() In this plot both the x and y axes are continuous. That is, they are numeric variables, using real numbers.
However, if we want to make a boxplot of the mpg column using am as the x axis then we have a problem:
mtcars %>%
ggplot(aes(am, mpg)) +
geom_boxplot()
Warning: Continuous x aesthetic -- did you forget aes(group=...)?We might have expected to see;
- miles per gallon on the y axis
- two separate boxes, one for automatic cars and another for manual.
This doesn’t work as expected though.
R spots that am is stored as numeric data, so it creates a continuous scale on the x axis. Only one box is drawn at the midpoint of all the values of am; because am ranges from 0 to 1 the box appears at 0.5.
We want to use am as a categorical variable, so we should convert it to a factor. Then our plot will work properly.
We can use the command factor(am) to tell R that the x-axis is a factor:
mtcars %>%
ggplot(aes(factor(am), mpg)) +
geom_boxplot()This gives us the boxplot we were expecting. The only change was to replace am with factor(am). This tells R to convert the variable am to a factor, and ggplot can guess the type of x axis correctly.
Work with a friend: Describe the 4 ways in which quantitative researchers might use the word ‘variable’?
(If you need to look these up from the video or text above then try testing yourself again after completing other other exercises).
Exercise XXX
Use glimpse to check the data types of the mpg and the diamonds datasets.
- The 4th variable in the
mpgdata is a - The 5th variable in the
diamondsdata is a
Exercise XXX
Use the mpg dataset to make a boxplot showing miles per gallon on the y axis, and number of cylinders on the x axis (cyl). Your plot should look like this:
Grouping with group_by
- Datasets often contain categorical variables
- We often want to compare statistics (like averages) between categories
- The
group_byfunction is a quick way to combine filtering and summarising group_bycreates a grouped dataframe- Adding
group_by()to a pipeline runs the subsequent steps once for each group. - The result is always a new dataframe
# look at the variables in the CO2 data
# which is about different grasses use of CO2 (uptake)
# under cold/normal conditions
CO2 %>% glimpse
# make a plot comparing CO2 update of types of grass
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()
# using filter with summarise to get the numbers from the graph
CO2 %>%
filter(Type=="Quebec") %>%
summarise(mean(uptake))
# (but this could get repetitive)
# instead we use group_by to make multiple summaries,
# one for each category
CO2 %>%
group_by(Type) %>%
summarise(mean(uptake))
# group by more than one variable at once
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# calculate the mean and standard deviation in a single step
# (this adds two columns to our output for mean and SD)
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake), sd(uptake))
# give the new column a name of our choosing
CO2 %>%
summarise(average_uptake = mean(uptake))In this video we’ll use a dataset about plants to show how to create summary statistics for multiple groups.
Plants photosynthesise by combining sunlight with carbon dioxide (CO2) to make sugars. They do this more when it’s warmer. The CO2 dataset records how much carbon dioxide plants used (uptake) when they were chilled or not-chilled. The type of plant (Type) and the conditions (chilled/unchilled) are both stored as factors:
CO2 %>% glimpse
Rows: 84
Columns: 5
$ Plant <ord> Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn2, Qn2, Qn2, Qn2, Qn2,…
$ Type <fct> Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Que…
$ Treatment <fct> nonchilled, nonchilled, nonchilled, nonchilled, nonchilled,…
$ conc <dbl> 95, 175, 250, 350, 500, 675, 1000, 95, 175, 250, 350, 500, …
$ uptake <dbl> 16.0, 30.4, 34.8, 37.2, 35.3, 39.2, 39.7, 13.6, 27.3, 37.1,…We might make a plot like this to compare the plant species:
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()The graph is helpful, but what if we want the actual numbers in a table, or to report?
Using filter and summarise
One option would be to filter our data first and then summarise:
CO2 %>%
filter(Type=="Quebec") %>%
summarise(mean(uptake))
mean(uptake)
1 33.54286We could then repeat that step for each type of grass:
CO2 %>%
filter(Type=="Mississippi") %>%
summarise(mean(uptake))
mean(uptake)
1 20.88333That would be repetitive though, and might get frustrating if there are lots of categories.
Using group_by
Instead of using filter, we can use group_by to split our dataset into multiple groups, summarising each one separately:
CO2 %>%
group_by(Type) %>%
summarise(mean(uptake))
# A tibble: 2 x 2
Type `mean(uptake)`
<fct> <dbl>
1 Quebec 33.5
2 Mississippi 20.9Nested groups
Another factor in this dataset is an experimental treatment: whether the grasses were chilled or nonchilled.
We can group by two factors at once and get a row for each combination:
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# A tibble: 4 x 3
# Groups: Type [2]
Type Treatment `mean(uptake)`
<fct> <fct> <dbl>
1 Quebec nonchilled 35.3
2 Quebec chilled 31.8
3 Mississippi nonchilled 26.0
4 Mississippi chilled 15.8Multiple statistics
Finally, we can calculate multiple statistics at once for each of the groups:
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake), sd(uptake))
# A tibble: 4 x 4
# Groups: Type [2]
Type Treatment `mean(uptake)` `sd(uptake)`
<fct> <fct> <dbl> <dbl>
1 Quebec nonchilled 35.3 9.60
2 Quebec chilled 31.8 9.64
3 Mississippi nonchilled 26.0 7.40
4 Mississippi chilled 15.8 4.06Give the new variables a name
R gives the new columns a name based on the function we use to summarise.
For example, if we use mean on the uptake variable then the new column is called mean(uptake) (see above).
When we use summarise, we can give the new column a specific name like this:
CO2 %>%
summarise(average_uptake = mean(uptake))
average_uptake
1 27.2131The new name shouldn’t include spaces or other ‘special’ characters.
Exercise XXX
Adapt the code below to calculate the median uptake for each grass in the CO2 data, when chilled and not-chilled:
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))Exercise XXX
Use the built-in iris dataset
Use group_by and summarise to calculate the average Sepal.Length of each Species of flower.
Exercise XXX
chickwts contains data for the weights of chicks (in grams) fed on different diets.
glimpse(chickwts)
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 18…
$ feed <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebe…Calculate the mean and standard deviation chick weights for each type of feed.
The mean weight of chicks fed on linseed was (to 2 decimal places) g.
The standard deviation of chicks fed on sunflower was (to 2 decimal places) g.
Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers are revealed in Session 4.
- Which function makes a boxplot?
- What is the difference between a
dbland afctorord? - Give an example of when the difference between
dblandfctmatters when making a plot? (include code examples for this if you can) - How can you convert a variable from a
dblto afct? - How could you calculate the mean for one level of a factor?
- How would you calculate the mean for all levels of a factor?
Extension exercises
Extension exercise XXX
Make a scatterplot of the diamonds data. Show carat on the x-axis, price on the y-axis and the clarity of the diamond in colour. Try to produce your plot before comparing it against the answer using the button below.
Extension exercise XXX
Make a scatterplot of the mpg data. Show city mpg on the x-axis, highway mpg on the y-axis and the vehicle class in colour. Try to produce your plot before comparing it against the answer using the button below.
Extension exercise 1
Make a boxplot showing life expectancy by continent for years greater than 1999. (Hint: use filter(), ggplot() and geom_boxplot().)
The plot should look like this:
Extension exercise XXX
This boxplot uses the gapminder dataset to show lifeExp (life expectancy) on the y-axis for each continent on the x-axis.
In a new chunk, write the R code to produce this plot.
Extension exercise XXX
Create a boxplot which shows drivetrain on the x-axis and miles per gallon when a car is driven in the city on the y-axis. Your plot should look like this:
Extension exercise XXX
Try to recreate the plot below. Remember to use factor to convert the type of the column.